Automatic Recovery of Punctuation Marks and Capitalization Information for Iberian Languages
نویسندگان
چکیده
This paper shows experimental results concerning automatic enrichment of the speech recognition output with punctuation marks and capitalization information. The two tasks are treated as two classification problems, using a maximum entropy modeling approach. The approach is language independent as reinforced by experiments performed on Portuguese and Spanish Broadcast News corpora. The discriminative models are trained for a language using spoken and written corpora from that language. This paper provides the first results on Spanish Broadcast News data and the first comparative study between Portuguese and Spanish, on this subject.
منابع مشابه
Recovering Capitalization and Punctuation Marks on Speech Transcriptions
This work addresses two metadata annotation tasks, involved in the production of rich transcripts: automatic capitalization, and punctuation marks recovery. The main focus concerns broadcast news, using both manual and automatic speech transcripts. Different capitalization models were analysed and compared, and results support the ideia that generative approaches capture the structure of writte...
متن کاملRecovering capitalization and punctuation marks for automatic speech recognition: Case study for Portuguese broadcast news
The following material presents a study about recovering punctuation marks, and capitalization information from European Portuguese broadcast news speech transcriptions. Different approaches were tested for capitalization, both generative and discriminative, using: finite state transducers automatically built from language models; and maximum entropy models. Several resources were used, includi...
متن کاملCommas Recovery with Syntactic Features in French and in Czech
Automatic speech transcripts can be made more readable and useful for further processing by enriching them with punctuation marks and other meta-linguistic information. We study in this work how to improve automatic recovery of one of the most difficult punctuation marks, commas, in French and in Czech. We show that commas detection performances are largely improved in both languages by integra...
متن کاملA Neural Network Architecture for Multilingual Punctuation Generation
Even syntactically correct sentences are perceived as awkward if they do not contain correct punctuation. Still, the problem of automatic generation of punctuation marks has been largely neglected for a long time. We present a novel model that introduces punctuation marks into raw text material with transition-based algorithm using LSTMs. Unlike the state-of-the-art approaches, our model is lan...
متن کاملSpeech recognition with automatic punctuation
We present a method of speech recognition with automatic punctuation based on a combination of acoustic and lexical evidence. In the recognizer vocabulary, punctuation marks are treated as word entries. By assigning the acoustic baseforms of silence, breath, and other non-speech sounds to punctuation marks, and using a properly processed N-gram language model, unpronounced punctuation marks of ...
متن کامل